This document describes the replication files associated with the following paper:
Tomoki Kaneko, Taka-aki Asano, and Hirofumi Miwa. “Extracting Ideological Dimensions from Legislative Speeches in the Japanese Diet.” Social Science Japan Journal.

#### Contained files ####

## Data preprocessing

0_text_preprocessing_to_dfm.R: R script to generate year-by-year document–feature matrices from the Diet speech data

## Wordshoal

1-1_Wordfish.R: R script to fit the Wordfish models to estimate parties’ year-by-year committee-specific positions
1-2_Wordfish_bootstrap.R: R script to conduct parametric bootstrap for the Wordfish models
1-3_factor_analysis.R: R script to integrate committee-specific positions into a unidimensional dynamic ideal-point scale
1-a_wordfish_dense_modified.cpp: auxiliary C++ script for parametric bootstrap
1-b_dynamic_factor_analysis.R: JAGS script to estimate the dynamic factor analysis model

## Exploration of outputs

2_exploration.R: R script to explore the Wordshoal outputs and visualize key findings

## Validation

3-1_UTAES_analysis.R: R script to obtain the UTAES-based measures of party ideal points
3-2_validation.R: R script to compare this study’s party latent traits and the existing measures of party ideal points
3-3_legislative_behavior.R: R script to validate this study’s party latent traits against party ideal points based on legislative behavior
3-a_party_estimates.csv: data file that contains this study’s party latent trait estimates, output from 2_exploration.R
3-b_Kato_scores.csv: data file that contains Kato’s expert survey measures of party ideal points as well as variable correspondence information between Manifesto Project Dataset and this study’s data
3-c_UTAES-based_scores.csv: data file that contains the UTAES-based measures of party ideal points, output from 3-1_UTAES_analysis.R
3-d_UTAES_IRT_2003.R: auxiliary R script to fit the graded IRT model using the 2003 UTAS data
3-d_UTAES_IRT_2005.R: auxiliary R script to fit the graded IRT model using the 2005 UTAS data
3-d_UTAES_IRT_2009.R: auxiliary R script to fit the graded IRT model using the 2009 UTAS data
3-d_UTAES_IRT_2012.R: auxiliary R script to fit the graded IRT model using the 2012 UTAS data
3-d_UTAES_IRT_2014.R: auxiliary R script to fit the graded IRT model using the 2014 UTAS data
3-d_UTAES_IRT_2017.R: auxiliary R script to fit the graded IRT model using the 2017 UTAS data

## Additional analyses

4-1_robustness_check.R: R script to conduct the robustness check
4-2_committee_specific_positions.R: R script to visualize parties’ year-by-year committee-specific positions
4-3_top_contributing_words.R: R script to extract words that highly contribute to the estimation of committee-specific positions
4-a_dynamic_factor_analysis_RC.R: JAGS script to estimate the dynamic factor analysis model for the robustness check

#### External data ####

## Document–feature matrices

Please download the preprocessed year-by-year document–feature matrices from the following links:
https://www.dropbox.com/scl/fo/sou0zj1gi094lloce9mp4/AHZmEyEQUsX2hN2SB34Bf-0?rlkey=nzxgre2eexuxubkf1xypafkuk&dl=0

Our code assumes that these files are stored in a folder named `dfm` under the working directory.

## Intermediate outputs

To facilitate replication, we also provide intermediate output files that allow users to reproduce the analyses from later stages of the workflow. These files are particularly useful for avoiding compatibility issues related to the quanteda package and its dependencies.

Wordfish outputs
https://www.dropbox.com/scl/fo/toa0asxt4b22u9hoowz8z/AG2BAvZtcKaskRL0KbCHgbQ?rlkey=atcpwy1volb5ukm883f0lne0i&dl=0

Wordfish bootstrap outputs
https://www.dropbox.com/scl/fo/zrsoap68zsrvvpmnhnti9/AOykL8drVaFpRtFsR3oyOY0?rlkey=9jh5hyj6aidx360t6aite1wut&dl=0

Wordshoal outputs
https://www.dropbox.com/scl/fo/0d66emq2n7x5ms7vkl4xj/AE-g6AQG7IV6n4gokdZ6NqU?rlkey=y5md88jitbjhhee8nqdltvy1o&dl=0

Our code assumes that these files are stored in folders named `Wordfish`, `Wordfish_bootstrap`, and `Wordshoal`, respectively, under the working directory.

## Third-party data

To replicate the validation analyses, the following third-party datasets are required. Please place these files in the working directory.

Manifesto Project Dataset (version 2024a), distributed as `MPDataset_MPDS2024a.csv`
https://manifestoproject.wzb.eu/datasets?archived=yes

UTokyo–Asahi survey data (2003–2017; both elite and voter surveys are required)
https://www.masaki.j.u-tokyo.ac.jp/utas/utasindex_en.html

House of Representatives bill database provided by the SmartNews Media Research Institute (2014–2019), distributed as `gian.csv`  
https://github.com/smartnews-smri/house-of-representatives

#### Notes on reproducibility ####

Our replication files do not fully reproduce the entire workflow underlying the results reported in the paper, for the following reasons.

First, the preprocessing of Diet speech data relied in part on commercial datasets to identify and harmonize speakers’ party affiliations. Because these proprietary data cannot be redistributed, we are unable to publish the materials required to reproduce the preprocessing stage starting from the raw speech data. Instead, we provide preprocessed year-by-year document–feature matrices together with the R code used for all subsequent analyses. For reference, we also include code for text preprocessing that illustrates how raw texts are converted into year-by-year document–feature matrices; however, this code does not run with the data included in the published replication files. Researchers interested in the preprocessing procedure are welcome to contact the authors for further details.

Second, even conditional on these preprocessed inputs, reproducing the exact model estimates reported in the paper may not always be possible. This project was developed over a long period of time, and unfortunately we did not systematically record the precise versions of R packages used at each stage of the analysis. This limitation is particularly relevant for quanteda and related packages, whose core functions and defaults have changed repeatedly over time and can affect both preprocessing outputs and model estimation. As a result, running the provided scripts with current package versions may yield results that differ from those reported in the paper. To mitigate this issue, we therefore also provide intermediate outputs (e.g., Wordfish and Wordshoal estimates), which allow readers to reproduce the downstream analyses without re-estimating the models from scratch.

For reference, the Wordshoal estimation was conducted between March and April 2021, and subsequent analyses based on the Wordshoal estimates were finalized in August 2025. Users who wish to replicate the Wordshoal procedure itself may find it helpful to install versions of quanteda and related packages from around April 2021.

#### Contact ####

For questions regarding these files or the replication materials, please contact Hirofumi Miwa.